library(MASS)
?crabs # read the help page to find out about the dataset
X1 = crabs |> dplyr::select(CL, CW) |>as.matrix()
Y1 = crabs |> dplyr::select(FL, RW, BD) |> as.matrix() Task 1
Consider again the crabs dataset you looked at in the exercises in the chapter on PCA. We now consider a canonical correlation analysis in which one set of variables, the \(\mathbf x\)-set, is given by CL and CW and the other set, the \(\mathbf y\)-set, is given by FL, RW and BD.
Sxx=cov(X1)
Syy=cov(Y1) # find y-variable variance matrix
Sxy=cov(X1, Y1) # find cross-covariance matrixSWAP |> for |>
RETURN TO CIRCLE PLOT - perhaps swap task order.
Task 1
Consider again the crabs dataset you looked at in the exercises in the chapter on PCA. We now consider a canonical correlation analysis in which one set of variables, the \(\mathbf x\)-set, is given by CL and CW and the other set, the \(\mathbf y\)-set, is given by FL, RW and BD.
- calculate \({\bf S}_{\bf x x}^{-1/2}\) and \({\bf S}_{\bf yy}^{-1/2}\) by first computing the spectral decomposition of \(\mathbf S_{xx}\) and \(\mathbf S_{yy}\).
We can check that this is correct by checking that we get the identity matrix when expected:
We can check that this is correct by checking that we get the identity matrix when expected:
[,1] [,2]
CL 1.000000e+00 1.953993e-14
CW -4.440892e-15 1.000000e+00
[,1] [,2]
[1,] 1.000000e+00 6.217249e-15
[2,] 6.217249e-15 1.000000e+00
- Now calculate the matrix \(\mathbf Q\) and compute its singular value decomposition.
$d
[1] 0.9901260 0.4836468
$u
[,1] [,2]
[1,] -0.7894005 -0.6138785
[2,] -0.6138785 0.7894005
$v
[,1] [,2]
[1,] -0.6490243 -0.2297495
[2,] -0.3210785 0.9469680
[3,] -0.6896927 -0.2246480
U=Q.svd$u # extract the U matrix (our notation) in svd
V=Q.svd$v # extract the V matrix (our notation) in svd
d=Q.svd$d # extract the singular values in svdU=Q.svd$u # extract the U matrix (our notation) in svd
V=Q.svd$v # extract the V matrix (our notation) in svd
d=Q.svd$d # extract the singular values in svd- Compute the first pair of CC vectors and CC variables \(\eta_1\) and \(\psi_1\). What is the 1st canonical correlation?
The first canonical correlation is the correlation between the first pair of CC vectors, which is the same as the first singular value.
The first canonical correlation is the correlation between the first pair of CC vectors, which is the same as the first singular value.
cor(eta1, psi1) [,1]
[1,] 0.990126
d[1][1] 0.990126
- Plot \(\psi_1\) vs \(\eta_1\). What does the plot tell you (if anything)?
plot(eta1,psi1) # plot x-scores against y-scores for 1st CCThis just shows the high correlation (as expected by the first canonical correlation) between \(\eta_1\) and \(\psi_1\)
- Repeat the above to find the second pair of CC vectors, and the second set of CC variables/scores, and plot these against each other and against the first CC scores. Is there any interesting structure in any of the plots? Which plots suggest random scatter?
a=Sxx.isqrt%*%U[,2] # calculate optimal x-weights
b=Syy.isqrt%*%V[,2] # calculate optimal y-weights
eta2=sweep(X1, 2, colMeans(X1))%*%a # calculate centred x-scores for 2nd CC
psi2=sweep(Y1, 2, colMeans(Y1))%*%b # calculate centred y-scores for 2nd CC
cor(eta2, psi2)a=Sxx.isqrt%*%U[,2] # calculate optimal x-weights
b=Syy.isqrt%*%V[,2] # calculate optimal y-weights
eta2=sweep(X1, 2, colMeans(X1))%*%a # calculate centred x-scores for 2nd CC
psi2=sweep(Y1, 2, colMeans(Y1))%*%b # calculate centred y-scores for 2nd CC
cor(eta2, psi2) [,1]
[1,] 0.4836468
d[2][1] 0.4836468
If we plot eta2 against psi2 we can see some weak correlation, which fits with the second canonical correlation being 0.48. If we plot eta2 against eta1 or psi1, we can see there is no correlation, as expected by the definition of the CC variables.
- Finally, repeat the analysis above using the
cccommand andplt.ccfrom the packageCCAwhich you will need to download.
Note that circle plots are explained below in the solution to task 2.
So we can see the first cc scores are highly negatively correlated with all of the respective original variables. This is because they are both just measuring the overall size of the crabs. If CL and CW are large, the RW, FL, and BD tend to be large.
The second set of cc scores is only weakly correlated to the variables in the opposing dataset: i.e., \(\psi_2\) the second CC score for the \(X\) variables is only weakly correlated with the \(Y\) variables.
cca$scores$corr.Y.xscores[,2] FL RW BD
-0.02732000 -0.22740190 -0.01305216
plot(cca$scores$xscores[,1], cca$scores$xscores[,2], col=crabs$sp)plot(cca$scores$yscores[,1], cca$scores$yscores[,2], col=crabs$sex)plot(cca$scores$xscores[,1], cca$scores$yscores[,2], col=crabs$sex)plot(cca$scores$xscores[,2], cca$scores$yscores[,1], col=crabs$sp)Task 2
=======Task 2
>>>>>>> 345b2aea9297a388f9a5a9dcbd8551052b84646eThe data for previous Premier League seasons is available at:
https://www.rotowire.com/soccer/league-table.php?season=2022
There is a button to download the csv (comma separated variable) file in the bottom right hand corner. Read the data into R (hint: try the read.csv command).
If you are not sure what the name of YOURDIRECTORY is where the file is located, then a useful command to find out is file.choose()
- Reproduce the analysis from the notes for the 2022-23 premier league season. Does the analysis produce a similar result as for the 2019-20 table?
If you are not sure what the name of YOURDIRECTORY is where the file is located, then a useful command to find out is file.choose()
- Reproduce the analysis from the notes for the 2022-23 premier league season. Does the analysis produce a similar result as for the 2019-20 table?
library(dplyr)
prem2223 <- read.csv('prem202223.csv')
# the data can be downloaded from https://www.rotowire.com/soccer/league-table.php
table <- prem2223 |> dplyr::select(Team, W, D, L, G, GA, GD)
X <- table[,c('W','D')] # W and D
Y <- table[,c('G','GA')] # G and GA
rownames(X)<- prem2223$Team
rownames(Y)<- prem2223$Team
library(CCA)
(prem.cca <- cc(X,Y))library(dplyr)
prem2223 <- read.csv('prem202223.csv')
# the data can be downloaded from https://www.rotowire.com/soccer/league-table.php
table <- prem2223 |> dplyr::select(Team, W, D, L, G, GA, GD)
X <- table[,c('W','D')] # W and D
Y <- table[,c('G','GA')] # G and GA
rownames(X)<- prem2223$Team
rownames(Y)<- prem2223$Team
library(CCA)
(prem.cca <- cc(X,Y))$cor
[1] 0.9671956 0.4452870
$names
$names$Xnames
[1] "W" "D"
$names$Ynames
[1] "G" "GA"
$names$ind.names
[1] "Manchester City" "Arsenal" "Manchester United"
[4] "Newcastle United" "Liverpool" "Brighton & Hove Albion"
[7] "Aston Villa" "Tottenham Hotspur" "Brentford"
[10] "Fulham" "Crystal Palace" "Chelsea"
[13] "Wolverhampton" "West Ham United" "AFC Bournemouth"
[16] "Nottingham Forest" "Everton" "Leicester City"
[19] "Leeds United" "Southampton"
$xcoef
[,1] [,2]
W -0.1676667 0.01115769
D -0.1015198 0.36320806
$ycoef
[,1] [,2]
G -0.03054968 -0.05935469
GA 0.04392604 -0.08569373
$scores
$scores$xscores
[,1] [,2]
Manchester City -1.8627265 -1.1949147
Arsenal -1.6289130 -0.8540220
Manchester United -1.1259130 -0.8874951
Newcastle United -1.2674051 1.9735387
Liverpool -0.8613258 0.5207064
Brighton & Hove Albion -0.4906194 -0.2168674
Aston Villa -0.3890996 -0.5800754
Tottenham Hotspur -0.2875797 -0.9432835
Brentford -0.5967385 1.9289079
Fulham 0.1139004 -0.6135485
Crystal Palace 0.2769678 1.1578610
Chelsea 0.3784877 0.7946530
Wolverhampton 0.6830472 -0.2949712
West Ham United 0.7845670 -0.6581793
AFC Bournemouth 0.8860869 -1.0213873
Nottingham Forest 0.7138210 0.7723376
Everton 0.7799678 1.1243880
Leicester City 1.1199004 -0.6804946
Leeds United 1.1506741 0.3868142
Southampton 1.6229003 -0.7139677
$scores$yscores
[,1] [,2]
Manchester City -2.14710923 -0.54560978
Arsenal -1.52455079 -1.04641890
Manchester United -0.60806040 0.73422192
Newcastle United -1.35281756 0.99761226
Liverpool -0.95170080 -0.61758279
Brighton & Hove Albion -0.59649554 -0.95368108
Aston Villa -0.26243453 0.89262359
Tottenham Hotspur -0.09613581 -1.69190898
Brentford -0.47628229 0.47714074
Fulham -0.07715099 0.05534872
Crystal Palace 0.20539006 1.28844404
Chelsea 0.17863735 1.57854089
Wolverhampton 0.87567151 1.05139274
West Ham United 0.40784692 0.65557229
AFC Bournemouth 1.26341191 -0.41875390
Nottingham Forest 1.10108412 -0.22102741
Everton 0.74009643 0.95902238
Leicester City 0.70393828 -0.99263843
Leeds United 1.23484769 -1.67151163
Southampton 1.38181366 -0.53078666
$scores$corr.X.xscores
[,1] [,2]
W -0.96308675 -0.2691912
D 0.06639999 0.9977931
$scores$corr.Y.xscores
[,1] [,2]
G -0.8607069 -0.2031206
GA 0.8599713 -0.2037799
$scores$corr.X.yscores
[,1] [,2]
W -0.93149323 -0.1198673
D 0.06422178 0.4443043
$scores$corr.Y.yscores
[,1] [,2]
G -0.8898995 -0.4561566
GA 0.8891390 -0.4576371
- Give an interpretation of the CC scores. One of doing this is to think about the correlation between the original variables and the scores (the transformed variables). Note that there are four different correlation matrices we can look at to aid interpretation: correlation between X and \(\eta\), \(X\) and \(\psi\), \(Y\) and \(\eta\), and \(Y\) and \(\psi\).
Circle plots can also help. Look at the help page for plt.cc and try some circle plots.
We can use the correlations between the original and transformed variables to interpret the CC scores. These correlations are automatically computed when you do CCA in R. So for example, the correlation between X and \(\eta\) is stored in
# correlation between X and transformed x scores (eta)
prem.cca$scores$corr.X.xscores [,1] [,2]
W -0.96308675 -0.2691912
D 0.06639999 0.9977931
cor(X, prem.cca$scores$xscores) [,1] [,2]
W -0.96308675 -0.2691912
D 0.06639999 0.9977931
The 3 other correlations of interest are
# correlation between X and transformed y scores (psi)
prem.cca$scores$corr.X.yscores [,1] [,2]
W -0.93149323 -0.1198673
D 0.06422178 0.4443043
# correlation between Y and transformed x scores (eta)
prem.cca$scores$corr.Y.xscores [,1] [,2]
G -0.8607069 -0.2031206
GA 0.8599713 -0.2037799
# correlation between Y and transformed y scores (psi)
prem.cca$scores$corr.Y.yscores [,1] [,2]
G -0.8898995 -0.4561566
GA 0.8891390 -0.4576371
Circle plots are useful ways of visualizing these correlations and can help us interpret the CCA output. They plot the correlation between the first two X CC scores (\(\eta\)) and the the original variables \(X\) (in red) and \(Y\) (in blue). I.e. they plot
Circle plots are again useful ways of interpretting the CCA output. They plot the correlation between the original \(Y\) variables and the first two X CC scores (in blue) and the correlation between the original \(X\) variables and the first two Y CC scores (in red).
prem.cca$scores$corr.Y.xscores [,1] [,2]
G -0.8607069 -0.2031206
GA 0.8599713 -0.2037799
prem.cca$scores$corr.X.yscores [,1] [,2]
W -0.93149323 -0.1198673
D 0.06422178 0.4443043
plt.cc(prem.cca, var.label = TRUE,type='v',ind.names = table$Team)So we can see that the first CC score for X (\(\eta_1\)) is highly negatively correlated with \(W\) and \(G\), and positively correlated with GA. This makes sense! If a team wins a lot we expect it to score a lot (large \(G\)) and concede rarely (low \(GA\)).
The second CC scores are less interesting here (because we only have 2 variables in X and Y, the second CC scores essentially are left to explain the remaining variability in the data). We can see that \(\eta_2\) is strongly correlated with \(D\), but unrelated to \(G\) and \(GA\).
To plot the correlation of the original variables with \(\psi\) (the \(Y\) CC scores), we have to change the order of the arguments to cc.
Swapping \(X\) and \(Y\) doesn’t change the underlying analysis, it just allows us to plot the other circle plot. We can see that \(\psi_1\) (the first CC \(Y\) score) is strongly positively correlated with GA, and negatively correlated with G and W. \(\psi_2\) is weakly correlated with D, G and GA. Note that due to some numerical quirk, the sign of the second score is reversed compared with when we did cc(X,Y).
So we can see that the first CC score for Y is highly negatively correlated with \(W\), whereas the first CC score for \(X\) is highly correlated with GA and negatively correlated with \(G\). This makes sense! If a team wins a lot we expect it to score a lot (large \(G\)) and concede rarely (low \(GA\)). The first canonical correlation is 0.955 which shows there is a very strong link between the first CC score for \(X\) and the first CC score for \(Y\)
The second CC scores are less interesting here (because we only have 2 variables in X and Y, the second CC scores essentially are left to explain the remaining variability in the data).
>>>>>>> 345b2aea9297a388f9a5a9dcbd8551052b84646eThe first pair of canonical correlation vectors are very similar to what we found for the 2019-20 season, with a similar first canonical correlation.
- Consider now doing CCA with \(\mathbf x=(W,D)\) and \(\mathbf y=(G,GA, L)\). Note that if you knew \(W\) and \(D\), you could calculate \(L\). Without doing any computation, what do you expect the first canoncial correlation to be? What will the first pair of CC vectors be (upto a multiplicative constant)?
The first canonical correlation is the correlation between a linear combination of \(\mathbf x\) and a linear combination of \(\mathbf y\) that is largest. Given that the correlation between \(W\) and \(D\) with \(L\) is 1, we expect the first canonical correlation to be 1, achieved with CC vectors \[\mathbf a_1 \propto(1\; -1)^\top \qquad \mathbf b_1 \propto (0\; 0\; 1)^\top\]
=======The first canonical correlation is the correlation between a linear combination of \(\mathbf x\) and a linear combination of \(\mathbf y\) that is largest. Given that the correlation between \(W\) and \(D\) with \(L\) is 1, we expect the first canonical correlation to be 1, achieved at \[\mathbf a_1 \propto(1\; -1)^\top \qquad \mathbf b_1 \propto (0\; 0\; 1)^\top\]
>>>>>>> 345b2aea9297a388f9a5a9dcbd8551052b84646e- Check your intuition by doing the calculation in R:
[1] 1.0000000 0.6504055
[1] 1.0000000 0.6504055
prem.cca2$xcoef [,1] [,2]
W -0.1668436 -0.01999535
D -0.1668436 0.33821530
prem.cca2$ycoef [,1] [,2]
G -5.551115e-17 -0.10374517
GA 6.245005e-17 0.06634862
L 1.668436e-01 -0.38272979
plt.cc(prem.cca2, type='v', var.label = TRUE)Task 3
We will now look data measured from 600 first year university students. Measurements were made on three psychological variables:
- Locus of Control: the degree to someone believes that they, as opposed to external forces, have control over the outcome of events in their lives.
- Self Concept: an indication of whether a person tends to hold a generally positive and consistent or negative and variable self-view.
- Motivation: how motivated an individual is
which will form our \(\mathbf X\) variables. The \(\mathbf Y\) variables are four academic scores (standardized test scores)
- Reading
- Writing
- Math
- Science
and gender (1=Male, 0 = Female) We are interested in how the set of psychological variables relates to the academic variables and gender.
Conduct CCA on these data. Provide an interpretation of your results.
Q3
The first canonical correlation is moderate and shows there is some link (but not a strong link) between some combination of the psych variables and the academic variables. The second and third canonical correlations are close to zero and so probably not worth looking at in detail.
We’re interested in the cross correlations: what the academic variables tell us about the pysch variables, and what the pysch variables tell us about the academic variables. Let’s look at these correlations.
We can visualize these on a circle plot.
Note: Because we did cc(psych, acad) this plots the correlation with the x-scores (\(\psi\)), i.e., the psych CC scores
[,1] [,2] [,3]
Control -0.91428518 0.3936580 0.09547756
Concept -0.09996773 0.4213083 -0.90139104
Motivation -0.58532551 -0.6061228 -0.53852502
[,1] [,2] [,3]
Read -0.3930571 0.03755922 -0.006144737
Write -0.4063141 -0.03388766 0.007646605
Math -0.3571452 0.02882125 -0.006381333
Science -0.3098729 0.10365354 0.005348465
If instead we swap the order of the data sets in the cc command, then plt.cc plots the correlations with the acad CC scores.
[,1] [,2] [,3]
Control -0.40817026 0.06037101 0.002148577
Concept -0.04462924 0.06461142 -0.020284434
Motivation -0.26131066 -0.09295440 -0.012118686
[,1] [,2] [,3]
Read -0.8804322 0.2449104 -0.2730572
Write -0.9101273 -0.2209695 0.3397966
Math -0.7999910 0.1879332 -0.2835709
Science -0.6941031 0.6758881 0.2376728
Either way, the interpretation is the same. There is a moderate strength link between locus of control and motivation, and performance in the academic subjects. Self concept appears to be unrelated to academic performance.